Case: Collaboration in Ecology

In this notebook we will use tethne to examine the evolution of ecological research following WWII, through the lens of the journal Ecology. In particular, we will use a variety of network models to examine the development of collaborations and patterns of production.

This notebook assumes that you have already completed Notebook 1 (Working with Data from the Web of Science).

Download the dataset for this notebook here: https://www.dropbox.com/s/t789uk7fv8ce6ze/ecology_complete.txt.zip?dl=0

This command tells iPython to render graphics in the notebook (instead of in a new window).


In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

First, we import the wos module from tethne.readers. We'll use this module to parse our data.


In [3]:
from tethne.readers import wos

We parse our Web of Science data using the read() function. This generates a new Corpus from our data. It may take several minutes, depending on your computer's resources.


In [4]:
corpus = wos.read('/Users/erickpeirson/Desktop/courseprep/ecology (1945-2012)/ecology_complete.txt')

The plot below shows the number of papers published in Ecology each year.


In [ ]:
plt.plot(*corpus.distribution())
plt.show()

Let's zoom in on the first 25 years of publication.

We first create a subcorpus for the years 1920-1944 (in Python, the end value of a range is excluded). We can then plot the volume of papers, as before.


In [9]:
corpus_early = corpus.subcorpus(('date', range(1920, 1945)))
plt.bar(*corpus_early.distribution())
plt.show()


Similarly, we will demarcate two further periods: 1945-1969, and 1970-1995.


In [10]:
corpus_mid = corpus.subcorpus(('date', range(1945, 1970)))
corpus_late = corpus.subcorpus(('date', range(1970, 1995)))

In [11]:
plt.figure(figsize=(15, 5))

plt.subplot(131)
plt.title('Early')
plt.ylabel('Number of Papers per Year')
plt.bar(*corpus_early.distribution())

plt.subplot(132)
plt.title('Mid')
plt.bar(*corpus_mid.distribution())

plt.subplot(133)
plt.title('Late')
plt.bar(*corpus_late.distribution())

plt.show()


Number of authors per paper

In the early period, the vast majority of publications are single-author papers. We can use the hist() function to display a histogram of the number of authors per paper.


In [59]:
# Calculate the number of authors per paper.
N_authors = [len(authors) for authors in corpus_early.features['authors'].features.values()]

# Generate a histogram.
plt.hist(N_authors, bins=range(1, 10))
plt.show()


Below we plot the histograms of author count for all three periods, this time normalizing the histograms so that we can compare them.


In [8]:
plt.hist([len(authors) for authors in corpus_early.features['authors'].features.values()],
         bins=range(1, 10), label='1920-1944', histtype='step', lw=4, normed=True)
plt.hist([len(authors) for authors in corpus_mid.features['authors'].features.values()],
         bins=range(1, 10), label='1945-1969', histtype='step', lw=4, normed=True)
plt.hist([len(authors) for authors in corpus_late.features['authors'].features.values()],
         bins=range(1, 10), label='1970-1995', histtype='step', lw=4, normed=True)

plt.legend()
plt.ylim(0, 1.)
plt.show()


There appears to be a slight increase the proportion of two- and three-author papers in the mid period, and a decrease in the proportion of single-author papers. The trend toward more two-, three-, and even four-author papers is further exagerated in the later period. The number of single-author papers is less than half of the overall output for in the later period.

Coauthorship

Co-authorship network models are popular tools for social network analysis of scientific fields. This is because the data used to generate those models are readily available, and because co-authorship does seem to capture something substantial about social ties among actors.


In [10]:
from tethne import coauthors    # We use the `coauthors()` function to create the network.

In [21]:
coauthor_graph_early = coauthors(corpus_early)

Number of nodes (order) and edges (size):


In [23]:
coauthor_graph_early.order(), coauthor_graph_early.size()


Out[23]:
(401, 323)

We can visualize the network in Cytoscape:


In [18]:
from tethne import write_graphml

In [14]:
write_graphml(coauthor_graph, '/Users/erickpeirson/Desktop/coauthors_early.graphml')

Let's do the same thing for the mid- and late-period subcorpora:


In [15]:
coauthor_graph_mid = coauthors(corpus_mid)
coauthor_graph_late = coauthors(corpus_late)

In [16]:
coauthor_graph_mid.order(), coauthor_graph_mid.size()


Out[16]:
(1511, 1343)

In [17]:
coauthor_graph_late.order(), coauthor_graph_late.size()


Out[17]:
(4151, 6221)

Size (N edges) relative to order (N nodes) gives us a sense of the relative density of the networks.


In [25]:
print '1920-1944:', float(coauthor_graph_early.size())/coauthor_graph_early.order()
print '1945-1969:', float(coauthor_graph_mid.size())/coauthor_graph_mid.order()
print '1970-1994:', float(coauthor_graph_late.size())/coauthor_graph_late.order()


1920-1944: 0.805486284289
1945-1969: 0.88881535407
1970-1994: 1.49867501807

Let's visualize these as well:


In [27]:
write_graphml(coauthor_graph_mid, '/Users/erickpeirson/Desktop/coauthors_mid.graphml')
write_graphml(coauthor_graph_late, '/Users/erickpeirson/Desktop/coauthors_late.graphml')

Network analysis with NetworkX

Tethne uses a package called NetworkX to represent networks in Python. This means that we can use the algorithms provided by NetworkX to analyze our networks.


In [22]:
import networkx as nx

Let's start with something simple, like degree. The degree of a node is the number of other nodes that are connected to that node (i.e. its neighbors). NetworkX has a function called degree() that calculates the degree of all of the nodes in a network. Below, we calculate the average degree for each co-author network.


In [33]:
print '1920-1944:', mean(nx.degree(coauthor_graph_early).values())
print '1945-1969:', mean(nx.degree(coauthor_graph_mid).values())
print '1970-1994:', mean(nx.degree(coauthor_graph_late).values())


1920-1944: 1.61097256858
1945-1969: 1.77763070814
1970-1994: 2.99735003614

Clearly the average number of neighbors is increasing -- people are collaborating with more people.

Another interesting feature of these networks is the amount of clustering. Clustering is when a node's neighbors tend also to be connected to each other. The clustering coefficient can be calculated for individual nodes, but looking at the average clustering coefficient gives us an indication of the degree of clustering globally. A low average clustering coefficient means that nodes tend to be connected at random.


In [68]:
print '1920-1944:', nx.average_clustering(coauthor_graph_early)
print '1945-1969:', nx.average_clustering(coauthor_graph_mid)
print '1970-1994:', nx.average_clustering(coauthor_graph_late)


1920-1944: 0.234216960651
1945-1969: 0.332672202507
1970-1994: 0.49012091954

Not only do the actors in these networks tend to form more connections (degree) in later periods, but they also tend to form more clusters.

Finally, assortativity gives us an indication of how actors organize themselves within the network. High assortativity means that nodes with a high number of neighbors tend to be connected to other well-connected nodes.


In [81]:
print '1920-1944:', nx.degree_assortativity_coefficient(coauthor_graph_early)
print '1945-1969:', nx.degree_assortativity_coefficient(coauthor_graph_mid)
print '1970-1994:', nx.degree_assortativity_coefficient(coauthor_graph_late)


 1920-1944: 0.0859861187818
1945-1969: 0.206739881331
1970-1994: 0.620196244596

The amount of assortativity also seems to be increasing over time.

GraphCollections

Often we're interested in looking at the evolution of a network over time, and it may not be appropriate or desirable to delimit time period ahead of time. GraphCollections give us a nice way to look at networks over continuous time. Essentially, a GraphCollection will use a sliding time-window to subdivide our corpus, and generate a graph from each of those subcorpora (much like we did above). It also indexes the nodes in the graphs, so that we can follow individual nodes over time.


In [5]:
from tethne import GraphCollection

To create the GraphCollection, we provide our Corpus, our network-building method (coauthors in this case), and then some optional parameters. slice_kwargs tells the GraphCollection how to generate the sliding time-window. In this case, we want periods of 6 years, and we want to advance the window 4 years at every step. So each network will represent 6 years of publications, and overlap with the next network by 2 years.


In [71]:
G = GraphCollection(corpus, coauthors, slice_kwargs={'window_size': 6, 'step_size': 4})

Here we can see that the number of nodes increases in what looks like an exponential fashion, disproportionate to the number of papers in each slice. This is consistent with our earlier observation that the average number of authors per paper is increasing: more authors will be included in the network.


In [76]:
paper_distribution = corpus.distribution(window_size=6, step_size=4)
plt.plot(*paper_distribution, label='Papers')

node_distribution = zip(*G.node_distribution().items())
plt.plot(*node_distribution, label='Nodes')

plt.legend()
plt.show()


Here we can look explicitly at the relation between the average number of authors per paper and the number of nodes in each slice. They are related in a fairly linear fashion:


In [78]:
mean_authors = zip(*[(k, mean([len(authors) for authors in sub.features['authors'].features.values()]))
                    for k, sub in corpus.slice(window_size=6, step_size=4)])

In [79]:
plt.scatter(mean_authors[1], node_distribution[1])


Out[79]:
<matplotlib.collections.PathCollection at 0x15e26dfd0>

Here we calculate the mean degree of each graph in the GraphCollection:


In [94]:
mean_degree = zip(*[(k, mean(nx.degree(g).values())) for k, g in G.items()])
plt.plot(*mean_degree, lw=2)
plt.ylabel('Mean degree (blue)')

# Plot number of nodes on a separate axis.
ax = plt.gca()
ax2 = ax.twinx()
ax2.plot(*node_distribution, c='g')
plt.ylabel('Number of nodes (green)')

plt.show()


Interestingly, even though the number of nodes is always increasing, there is a dip in the mean degree around the mid-1940s, followed by a period of more rapid increase.

We can use the GraphCollection's analyze() method to apply one of the algorithms that we used above over time.


In [106]:
clustering = zip(*G.analyze('average_clustering').items())
plt.plot(*clustering, c='orange', lw=2)
plt.ylabel('Average clustering coefficient (orange)')

ax = plt.gca()
ax2 = ax.twinx()
ax2.plot(*mean_degree)
plt.ylabel('Mean degree (blue)')
plt.show()


We see a similar pattern in the average clustering coefficient: the extent of clustering actually decreases at first, and then picks up around the middle of the 1940s.


In [111]:
mean_assortativity = zip(*G.analyze('degree_assortativity_coefficient').items())

plt.figure(figsize=(10, 7))

plt.plot(*mean_assortativity, lw=2, c='purple', label='Assortativity coefficient')
plt.plot(*clustering, c='orange', lw=1, label='Average clustering coefficient')
plt.legend(loc='best')

ax = plt.gca()
ax2 = ax.twinx()
ax2.plot(*mean_degree)
plt.ylabel('Mean degree (blue)')

plt.show()


There is an even more dramatic dip in the extent of assortativity in the 1940s, as well.

Bibliographic coupling

Let's take a look at the content of the papers themselves, in terms of their bibliographic citations.


In [6]:
from tethne import bibliographic_coupling

Let's take a look at the topology of the literature for each of our three periods, from earlier.


In [ ]:
graph_bc_early = bibliographic_coupling(corpus_early, node_attrs=['date', 'title'])
graph_bc_mid = bibliographic_coupling(corpus_mid, node_attrs=['date', 'title'])
graph_bc_late = bibliographic_coupling(corpus_late, node_attrs=['date', 'title'])

The code below fixes a bug in the network-building process. This will be fixed in later versions of Tethne.


In [ ]:
for g in [graph_bc_early, graph_bc_mid, graph_bc_late]:
    g.name = ''

In [ ]:
write_graphml(graph_bc_early, '/Users/erickpeirson/Desktop/graph_bc_early.graphml')

In [ ]:
write_graphml(graph_bc_mid, '/Users/erickpeirson/Desktop/graph_bc_mid.graphml')

In [ ]:
write_graphml(graph_bc_late, '/Users/erickpeirson/Desktop/graph_bc_late.graphml')

In [ ]: